⚡️ Speed up function `all_columns_match` by 147% #43

codeflash-ai · 2025-11-19T21:04:34Z

📄 147% (1.47x) speedup for `all_columns_match` in `datacompy/fugue.py`

⏱️ Runtime : 1.22 milliseconds → 496 microseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 146% speedup through two key optimizations that eliminate redundant computations:

1. Optimized unq_columns function:

Original: Created two OrderedSet objects and used set subtraction: OrderedSet(col1) - OrderedSet(col2)
Optimized: Creates only one set(col2) and uses list comprehension with membership testing: OrderedSet(c for c in col1 if c not in col2_set)
Why faster: Set membership testing (c not in col2_set) is O(1) on average vs. the overhead of creating multiple OrderedSet objects and performing set arithmetic

2. Completely reimplemented all_columns_match function:

Original: Called unq_columns() twice, effectively calling fa.get_column_names() four times total and performing complex OrderedSet operations
Optimized: Calls fa.get_column_names() only twice (once per dataframe) and directly compares set(col1) == set(col2)
Why faster: The line profiler shows fa.get_column_names() is expensive (~10ms per call). Reducing from 4 calls to 2 calls plus using simple set equality eliminates the computational overhead of OrderedSet operations entirely.

Performance impact: The profiler data shows the original all_columns_match spent 100% of its time calling unq_columns, which in turn spent 99.8% of its time in fa.get_column_names(). The optimized version eliminates half of these expensive calls and replaces complex OrderedSet arithmetic with fast set operations.

This optimization is particularly beneficial for workloads that frequently check column matching between dataframes, as it reduces both the computational complexity and the number of expensive external API calls.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 42 Passed
🌀 Generated Regression Tests	🔘 None Found
⏪ Replay Tests	✅ 10 Passed
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_fugue/test_duckdb.py::test_all_columns_match_duckdb`	159μs	62.8μs	154%✅
`test_fugue/test_fugue_pandas.py::test_all_columns_match_native`	181μs	70.4μs	158%✅
`test_fugue/test_fugue_polars.py::test_all_columns_match_polars`	215μs	91.7μs	135%✅
`test_fugue/test_fugue_spark.py::test_all_columns_match_spark`	213μs	66.5μs	221%✅

⏪ Replay Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_fugue_all_columns_match`	453μs	204μs	122%✅

To edit these changes git checkout codeflash/optimize-all_columns_match-mi6hq9u0 and push.

The optimized code achieves a **146% speedup** through two key optimizations that eliminate redundant computations: **1. Optimized `unq_columns` function:** - **Original**: Created two `OrderedSet` objects and used set subtraction: `OrderedSet(col1) - OrderedSet(col2)` - **Optimized**: Creates only one `set(col2)` and uses list comprehension with membership testing: `OrderedSet(c for c in col1 if c not in col2_set)` - **Why faster**: Set membership testing (`c not in col2_set`) is O(1) on average vs. the overhead of creating multiple OrderedSet objects and performing set arithmetic **2. Completely reimplemented `all_columns_match` function:** - **Original**: Called `unq_columns()` twice, effectively calling `fa.get_column_names()` four times total and performing complex OrderedSet operations - **Optimized**: Calls `fa.get_column_names()` only twice (once per dataframe) and directly compares `set(col1) == set(col2)` - **Why faster**: The line profiler shows `fa.get_column_names()` is expensive (~10ms per call). Reducing from 4 calls to 2 calls plus using simple set equality eliminates the computational overhead of OrderedSet operations entirely. **Performance impact**: The profiler data shows the original `all_columns_match` spent 100% of its time calling `unq_columns`, which in turn spent 99.8% of its time in `fa.get_column_names()`. The optimized version eliminates half of these expensive calls and replaces complex OrderedSet arithmetic with fast set operations. This optimization is particularly beneficial for workloads that frequently check column matching between dataframes, as it reduces both the computational complexity and the number of expensive external API calls.

codeflash-ai bot requested a review from mashraf-222 November 19, 2025 21:04

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `all_columns_match` by 147% #43

⚡️ Speed up function `all_columns_match` by 147% #43

Uh oh!

codeflash-ai bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function all_columns_match by 147% #43

Are you sure you want to change the base?

⚡️ Speed up function all_columns_match by 147% #43

Uh oh!

Conversation

codeflash-ai bot commented Nov 19, 2025

📄 147% (1.47x) speedup for all_columns_match in datacompy/fugue.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `all_columns_match` by 147% #43

⚡️ Speed up function `all_columns_match` by 147% #43

📄 147% (1.47x) speedup for `all_columns_match` in `datacompy/fugue.py`